Search Result

Select

Star join algorithm based on multi-dimensional Bloom filter in Spark

ZHOU Guoliang, SA Churila, ZHU Yongli

Journal of Computer Applications 2016, 36 (2): 353-357. DOI: 10.11772/j.issn.1001-9081.2016.02.0353

Abstract （923）

PDF （765KB）（889）

Save

To meet the high performance analysis requirements for real-time data in On-Line Analytical Processing (OLAP) system, a new star join algorithm which is suitable for Spark platform was proposed based on Multi-Dimensional Bloom Filter (MDBF), namely SMDBFSJ (Spark Multi-Dimensional Bloom Filter Star Join). First of all, MDBF was built according to the dimension tables and broadcasted to all the nodes based on the feature of small size. Then the fact table was filtered completely on the local node, and there was no data movement between nodes. Finally, the filtered fact table and dimension tables were joined using repartition join model to get the final result. SMDBFSJ algorithm avoides the data moving of fact table, reduces the size of broadcasting data using MDBF, as well as fully combines the advantages of broadcast join and repartition join. The experimental results prove the validity of SMDBFSJ algorithm, in stand-alone and cluster environments. SMDBFSJ algorithm can obtain about three times of performance improvement compared with repartition join in Spark.

Reference | Related Articles | Metrics

Select

Parallel cube computing in Spark

SA Churila, ZHOU Guoliang, SHI Lei, WANG Liuwang, SHI Xin, ZHU Yongli

Journal of Computer Applications 2016, 36 (2): 348-352. DOI: 10.11772/j.issn.1001-9081.2016.02.0348

Abstract （477）

PDF （769KB）（961）

Save

In view of the poor real-time response capability of traditional OnLine Analytical Processing (OLAP) when processing big data, how to accelerate computation of data cubes based on Spark was investigated, and a memory-based distributed computing framework was put forward. To improve parallelism degree and performance of Bottom-Up Construction (BUC), a novel algorithm for computation of data cubes was designed based on Spark and BUC, referred to as BUCPark (BUC on Spark). Moreover, to avoid the expansion of iterative data cube in memory, BUCPark was fruther improved to LBUCPark (Layered BUC on Spark) which could take full advantage of reused and shared memory mechanism. The experimental results show that LBUCpark outperforms BUC and BUCPark algorithms in terms of computing performace, and it is capable of computing data cube efficiently in big data era.

Reference | Related Articles | Metrics

Select

Parallel fuzzy C-means clustering algorithm in Spark

WANG Guilan, ZHOU Guoliang, SA Churila, ZHU Yongli

Journal of Computer Applications 2016, 36 (2): 342-347. DOI: 10.11772/j.issn.1001-9081.2016.02.0342

Abstract （1114）

PDF （901KB）（1347）

Save

With the growing data volume and timeliness requirement, the clustering algorithms need to be adaptive to big data and higher performance. A new algorithm named Spark Fuzzy C-Means (FCM) was proposed based on Spark distributed in-memory computing platform. Firstly, the matrix was partitioned into vector set horizontally and distributedly stored, which meant different vectors were distributed in different nodes. Then based on the characteristics of FCM algorithm, matrix operations were redesigned considering distributed storage and cache sensitivity, including multiplication, addition and transpose. Finally, Spark-FCM algorithm which combined with matrix operations and Spark platform was implemented. The primary data structures of the algorithm adopted distributed matrix storage with fewer moving data between nodes and distributed computing in each step. The test results in stand-alone and cluster environments show that Spark-FCM has good scalability and can adjust to large-scale data sets, the performance and the size of data shows a linear relationship, and the performance in cluster environment is 2 to 3 times higher than that in stand-alone.

Reference | Related Articles | Metrics

Select

Real-time clustering for massive data using Storm

WANG Mingkun YUAN Shaoguang ZHU Yongli WANG Dewen

Journal of Computer Applications 2014, 34 (11): 3078-3081. DOI: 10.11772/j.issn.1001-9081.2014.11.3078

Abstract （303）

PDF （611KB）（756）

Save

In order to improve the real-time response ability of massive data processing, Storm distributed real-time platform was introduced to process data mining, and the Density-Based Spatial Clustering of Application with Noise (DBSCAN) clustering algorithm based on Storm was designed to deal with massive data. The algorithm was divided into three main steps: data collection, clustering analysis and result output. All procedures were realized under the pre-defined component of Storm and submitted to the Storm cluster for execution. Through comparative analysis and performance monitoring, the system shows the advantages of low latency and high throughput capacity. It proves that Storm suits for real-time processing of massive data.

Reference | Related Articles | Metrics